Forschungspraktikum 1+2: Computational Social Science

Session 07: Topic Modeling

Dr. Christian Czymara

Agenda

  • What are topic models and what do you use them for?
  • The topic modeling algorithm
  • Estimating effects on document-level covariates
  • Choosing the number of topics
  • Validation
  • Tutorial: Identifying topics in news articles

Recap: Unsupervised Machine Learning

  • Inductive approach: No predefined categories needed
  • Finding unexpected patterns or new insights in data
  • Suited for explorative research questions
  • Less suitable for confirmatory research questions (hypothesis testing)

Topic Modeling

Topic Models

  • Topic Modeling is an algorithm for finding the most important themes (topics) in large text collections (corpus)
  • Requires little prior knowledge
  • Answers questions such as:
    • What topics are being discussed?
    • How frequently are these topics addressed?

Latent Dirichlet Allocation

  • “Classical” algorithm: Latent Dirichlet Allocation (LDA)
    • Assigning topics to documents (\(\theta\), theta)
    • Assigning words to topics (\(\phi\), phi)
  • Two steps:
    1. Identifying co-occurring words
    2. Analyzing how these words are distributed across texts
  • Unsupervised machine learning, no labeled data required

Document-Topic Distribution (\(\theta\))

  • Each document has its own unique distribution
  • \(\theta_d\) is sampled from a Dirichlet distribution with parameter \(\alpha\), which controls how evenly topics are distributed across documents:
    • Higher \(\alpha\): Documents are likely to cover multiple topics evenly
    • Lower \(\alpha\): Documents tend to focus on a smaller set of topics

How They Work Together in LDA

  • Both distributions are iteratively learned during training
  • A document is generated by choosing a mix of topics (\(\theta\))
  • Each word in the document is drawn by:
    • Selecting a topic from the document’s \(\theta\)
    • Picking a word from the topic’s \(\phi\)

Topic-Word Distribution (\(\phi\))

  • Probability distribution over words for each topic
  • Specifiying the likelihood of each word being associated with a specific topic
  • \(\phi_k = [P(W_1 | T), P(W_2 | T), \ldots, P(W_v | T)]\)
    • \(W_i\): Word \(i\)
    • \(T\): The given topic
    • \(v\): Vocabulary size

Topic-Word Distribution (\(\phi\))

  • Each topic has its own unique distribution
  • \(\phi_k\) is sampled from a Dirichlet distribution with parameter \(\beta\), which controls how evenly words are distributed within topics:
    • Higher \(\beta\): Topics are likely to include many words evenly
    • Lower \(\beta\): Topics are likely to focus on a few high-probability words

Document-Topic Distribution (\(\theta\))

  • Probability distribution over topics for each document
  • Specifying the likelihood of a document being associated with each topic
  • \(\theta_d = [P(T_1 | D), P(T_2 | D), \ldots, P(T_k | D)]\)
    • \(T_i\): Topic \(i\)
    • \(D\): The given document
    • \(k\): Total number of topics

Simple Example

  • Document A: “The election results are influenced by government policies.”
  • \(\theta\): 80% Politics, 20% Economy
  • \(\phi\) (Politics):
    • “election”: 30%
    • “government”: 25%
    • “policy”: 20%

Assumptions Underlying Topic Models

  • Bag-of-words: Word order does not matter
  • Mixed membership: Documents can include multiple topics
  • Similar topics tend to share similar words
  • Some words are more characteristic of certain topics

The Algorithm

  • Iterative process designed to maximize two goals simultaneously:
    • Words that occur together frequently are more likely to belong to the same topic
    • Words in the same document are more likely to belong to the same topic

Topic Models in Action

Kling (2016): Topic Modelling Portal

Structural Topic Models (STM)

  • Extends LDA by incorporating metadata like author demographics, time, or location
  • Again, two distributions:
    • \(\theta\): Document-topic distribution influenced by metadata
    • \(\beta\): Topic-word distribution conditioned on metadata (analogous to \(\phi\) in LDA)
    • \(\alpha\) prior of LDA is equivalent to \(\gamma\) in STM; \(\beta\) in LDA is \(\sigma\) in STM

Applying STM

  • stm package for R by Roberts, Stewart, and Tingley (2019)
  • First, estimate a topic model (stm())
  • Second, after topic model:
    • Regression with documents as units of analysis
    • Topic frequencies as dependent variables
    • Text properties as predictors (see ?estimateEffect)

Example: Concerns during the COVID-19 crisis

Respondent A

[1] "- Das Leben in der Wohngemeinschaft hat sich intensiviert, es wurden dadurch mehr Regeln/Absprachen getroffen, insgesamt erlebe ich das aber als positiv\n- Andere Menschen nicht treffen zu können, belastet mich manchmal. Doch über soziale Medien gibt es ja viele Möglichkeiten, persönliche Treffen zu substituieren, Diese nutze ich derzeit viel häufiger als sonst\n- In meiner Nachbarschaft haben sich einige solidarische Initiativen gegründet, was mich sehr freut und was ich sehr positiv erlebe"
  • “Life in the shared apartment has intensified, more rules/agreements have been made, but overall I experience this as positive - not being able to meet other people sometimes puts a strain on me.”

Respondent A

  • What type of person is this?
data$gender[1]
[1] "female"
data$wohntyp[1]
[1] not alone, no kids
Levels: couple with kids living alone not alone, no kids single parent

Respondent B

data$OF01_01[73]
[1] "Heimarbeit, Haushalt und Kinder sind sehr anstrengend. Die Kosten steigen, Geld fehlt. Macht mich langsam depressiv. "
  • “Working at home, household and children are very exhausting. Costs are rising, money is lacking. Makes me slowly depressive.”

Respondent B

  • What type of person is this?
data$gender[73]
[1] "female"
data$wohntyp[73]
[1] couple with kids
Levels: couple with kids living alone not alone, no kids single parent

Respondent C

data$OF01_01[51]
[1] "Als Vater in Elternzeit ist die einzige wirkliche Änderung das Ausbleiben von Spieletreffs - mir macht die Wirtschaft am meisten Sorgen und das im Umfeld nun eigentlich vernünftige Personen zu Impf und Virusskeptikern werden"
  • “As a father on parental leave, the only real change is the absence of game meetings - I’m most concerned about the economy and that in the environment now actually reasonable people become vaccination and virus skeptics.”

Respondent C

  • What type of person is this?
data$gender[51]
[1] "male"
data$wohntyp[51]
[1] couple with kids
Levels: couple with kids living alone not alone, no kids single parent

Document-Feature-Matrix

DFM_priv
Document-feature matrix of: 1,119 documents, 3,135 features (99.19% sparse) and 2 docvars.
       features
docs    wohngemeinschaft intensiviert regeln getroff insgesamt erleb positiv
  text1                1            1      1       1         1     2       2
  text2                0            0      0       0         0     0       0
  text3                0            0      0       0         0     0       0
  text4                0            0      0       0         0     0       0
  text5                0            0      0       0         0     0       0
  text6                0            0      0       0         0     0       0
       features
docs    treff belastet manchmal
  text1     2        1        1
  text2     0        0        0
  text3     0        0        0
  text4     1        0        0
  text5     0        0        1
  text6     0        0        0
[ reached max_ndoc ... 1,113 more documents, reached max_nfeat ... 3,125 more features ]

Run Topic Model

  • stm() function
library(stm)

gender_topics <- stm(DFM_priv,
                     K = 8, # Number of topics
                     prevalence = ~ gender, # Covariate
                     data = DFM_priv@docvars,
                     verbose = FALSE
                     )

Results: \(\beta\)

summary(gender_topics)$prob
A topic model with 8 topics, 1119 documents and a 3135 word dictionary.
     [,1]         [,2]         [,3]         [,4]          [,5]         
[1,] "wirtschaft" "kris"       "corona"     "deutschland" "derzeit"    
[2,] "einkauf"    "haus"       "abstand"    "halt"        "homeoffic"  
[3,] "mutt"       "woch"       "haus"       "besuch"      "geh"        
[4,] "sorg"       "finanziell" "einschrank" "eltern"      "mach"       
[5,] "verhalt"    "sozial"     "umfeld"     "angst"       "hamsterkauf"
[6,] "sozial"     "treff"      "zuhaus"     "fehlt"       "telefoni"   
[7,] "offic"      "hom"        "massnahm"   "wichtig"     "gleichzeit" 
[8,] "kind"       "schul"      "homeoffic"  "positiv"     "halt"       
     [,6]        [,7]      
[1,] "stark"     "sozial"  
[2,] "find"      "sohn"    
[3,] "tocht"     "eltern"  
[4,] "positiv"   "verbring"
[5,] "alt"       "corona"  
[6,] "vermiss"   "allein"  
[7,] "sozial"    "leid"    
[8,] "miteinand" "sozial"  

Topic Prevalence (\(\theta\))

  • How often are these topics addressed?
  • Plot the probabilities of each topic (i. e. \(\theta\))
plot(gender_topics)

Naming Topics, Part 1

Topic 1: Economy Topic 2: Everyday Life Topic 3: Family Topic 4: Individual Concerns
economy shopping mom worry
kris house woch financial
corona distance house cabinet
germany stop visit parents
currently home office go do

Naming Topics, Part 2

Topic 5: Society Topic 6: Contacts Topic 7: Work Topic 8: Childcare
behavior social offic child
social meeting hom school
environment home measure homeoffic
fear missing important positive
hamster purchase phone same time stop

Validation

Validating Topics

  • Topic models produce probabilities; interpretation is up to the researcher
  • Manual tests: Qualitatively reviewing example texts for interpretability
  • Technical tests: Fit statistics, coherence, exclusivity

Qualitative Validation

Topic Contacts

  • “It makes me sad that I can’t see my relatives. Talking on the phone or skyping is better than nothing, but it doesn’t replace personal contact and a hug.” (woman)

Topic Childcare

  • “Currently I have to take care of three children, a school child, kindergarten child and infant. It’s a balancing act. One has to be homeschooled, the kindergarten child wants to play, the baby still needs a lot of care. That’s stress.” (woman)

Topic Economy

  • “In my opinion, when it comes to weighing up”health” vs “economic requirements”, too much is currently being listened to the medical profession and too little to economists. “ (man)

Tutorial 07: Exercises 1.-2.

Estimating Differences between Covariates

How are the Topics Distributed Between Men and Women?

  • Regress each topic probability (1:Ntopic) on the document covariate (gender)
gender_diff <- estimateEffect(1:Ntopic ~ gender,
                              gender_topics,
                              meta = DFM_priv@docvars
                              )

How are the Topics Distributed Between Men and Women?

plot(gender_diff, covariate="gender",
     method  = "difference",
     cov.value1 = "male", cov.value2 = "female"
     )

… With ggplot2

Limitations of Topic Models

  • Results (may) depend heavily on pre-processing decisions
  • Topic labeling is subjective
  • Not all topics may be meaningful
  • Analysis of all topics can be overwhelming
  • Content of topics can be influenced by the number of topics, especially if the number of topics is small
  • Supervised methods are better if categories are predefined

Tutorial 07: Exercises 3.

Quantitative Validation

Chosing the Number of Topics

  • Most important question for topic models: chosing the number of topics in the model (\(K\))
  • Affects interpretability and accuracy of the model:
    • Too few topics: Overly broad themes
    • Too many topics: Overfitting and redundant topics
  • Goal: Find the optimal \(K\) that balances coherence and complexity

searchK()

  • The STM package provides searchK() to help identify the optimal number of topics
  • Evaluates multiple metrics across a range of \(K\):
    • Semantic Coherence: Measures interpretability of topics
    • Held-out Likelihood: Predictive accuracy on unseen data
    • Residuals: Fit of the model to the data
    • Exclusivity: Distinctiveness of topics
  • Trade-offs: Balance coherence, exclusivity, and likelihood

Applying searchK()

out <- convert(DFM_priv, to = "stm")

set.seed(1337)
kresult <- searchK(out$documents,
                   out$vocab,
                   seq(4, 16, 4),
                   prevalence = ~ gender,
                   data = out$meta,
                   verbose = FALSE
                   )

Plotting the Results of searchK()

plot(kresult)

Guidelines for Choosing \(K\)

  • Focus on metrics that align with your goals:
    • High coherence for interpretability
    • High exclusivity for distinct topics
  • Validate \(K\) with domain knowledge and qualitative assessment of topics

Validating Semantic Coherence and Exclusivity of Your Topic Model

topicQuality(model = gender_topics,
             documents = DFM_priv)

Example: Immigration News in Right-Wing Media

Example: Immigration News in Right-Wing Media

  • “To make an informed, inductive choice, we ran models with different numbers of topics, ranging from 10 to 150 (Jacobs and Tschötschel 2019). We then calculated semantic coherence and exclusivity and plotted both against each other (see Figure A1 in the appendix). […] Based on the final model, coherence and exclusivity are relatively high for all topics that are relevant for this study, as Figure A2 in the appendix displays (Table A1 in the appendix shows information for all topics).” (appendix)

Tutorial 07: Exercises 4.